By Andreas Troxler, June 2022
An abundant amount of information is available to insurance companies in the form of text. However, language data is unstructured, sometimes multilingual, and single words or phrases taken out of context can be highly ambiguous. By the help of transformer models, text data can be converted into structured data and then used as input to predictive models.
In this Part I of tutorial, you will discover the use of transformer models for text classification. Throughout this tutorial, the HuggingFace Transformers library will be used.
This notebook serves as a companion to the tutorial "Actuarial Applications of Natural Language Processing Using Transformers”. The tutorial explains the underlying concepts, and this notebook illustrates the implementation. This tutorial, the dataset and the notebooks are available on github.
After competing this tutorial, you will know:
Let’s get started.
This notebook is divided into into seven parts; they are:
1.1 Prerequisites
A brief introduction to the HuggingFace ecosystem
2.1 Loading the data into a DataSet
Using transformers to extract features for classification or regression tasks
3.1 Extracting the encoded text ...
3.2 ... and using it in a classification model
3.3 Case study: use accident descriptions to predict the number of vehicles involved
Fine-tuning – improving the model
Understand predictions errors and interpret predictions
5.1. Case study: use accident descriptions to identify bodily injury
5.2. Investigate false positives and false negatives
5.3. Use Captum and transformers-interpret to interpret predictions
This notebook is computationally intensive. We recommend using a platform with GPU support.
We have run this notebook on Google Colab and on an Amazon EC2 p2.xlarge instance (an older generation of GPU-based instances).
Please note that the results may not be reproducible across platforms and versions.
Make sure the following files are available in the directory of the notebook:
tutorial_utils.py - a collection of utility functions used throughout this notebook, explained in Section 3.2NHTSA_NMVCCS_extract.parquet.gzip - the dataThis notebook will create the following subdirectories:
datasets - pre-processed datasetsmodels - trained Transformer modelsresults - figures and Excel filesFor this tutorial, we assume that you are already familiar with Python and Jupyter Notebook.
In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readability.
# Notebook settings
# clear the namespace variables
from IPython import get_ipython
get_ipython().run_line_magic("reset", "-sf")
# formatting: cell width
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
The following libraries are required:
!pip install transformers
!pip install datasets
!pip install transformers_interpret
!pip install plotly
!pip install kaleido
!pip install pyyaml==5.4.1 ## https://github.com/yaml/pyyaml/issues/576
from datasets import Dataset, DatasetDict, load_from_disk
from transformers import AutoTokenizer, AutoModel, Trainer, TrainingArguments, trainer_utils, AutoModelForMaskedLM,\
DataCollatorForLanguageModeling, AutoModelForSequenceClassification, pipeline
from transformers_interpret import SequenceClassificationExplainer
import torch
import pandas as pd
import numpy as np
from scipy.special import softmax
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
import plotly.express as px
from wordcloud import WordCloud
from tutorial_utils import extract_sequence_encoding, get_xy, dummy_classifier, logistic_regression_classifier, evaluate_classifier
In addition, we require openpyxl to enable export from Pandas to Excel.
The data used throughout this tutorial is derived from data of a vehicle crash causation study performed in the United States from 2005 to 2007. The dataset has almost 7'000 records, each relating to one accident. For each case, a verbal description of the accident is available in English, which summarizes road and weather conditions, vehicles, drivers and passengers involved, preconditions, injury severities, etc. The same information is also encoded in tabular form, so that we can apply supervised learning techniques to train the NLP models and compare the information extracted from the verbal descriptions with the encoded data.
The original data consists of multiple tables. For this tutorial, we have aggregated it into a single dataset and added German translations of the English accident descriptions. The translations were generated using the new DeepL python API.
To explore the data, let's load it into a Pandas DataFrame and examine its shape, columns and data types:
df = pd.read_parquet("NHTSA_NMVCCS_extract.parquet.gzip")
print(f"shape of DataFrame: {df.shape}")
print(*list(zip(df.columns, df.dtypes)), sep="\n")
shape of DataFrame: (6949, 16)
('level_0', dtype('int64'))
('index', dtype('int64'))
('SCASEID', dtype('int64'))
('SUMMARY_EN', dtype('O'))
('SUMMARY_GE', dtype('O'))
('INJSEVA', dtype('int64'))
('NUMTOTV', dtype('int64'))
('WEATHER1', dtype('int64'))
('WEATHER2', dtype('int64'))
('WEATHER3', dtype('int64'))
('WEATHER4', dtype('int64'))
('WEATHER5', dtype('int64'))
('WEATHER6', dtype('int64'))
('WEATHER7', dtype('int64'))
('WEATHER8', dtype('int64'))
('INJSEVB', dtype('int64'))
The column SCASEID is a unique case identifier.
The columns SUMMARY_EN and SUMMARY_GE are strings representing the verbal descriptions of the accident
in English and German, respectively.
NUMTOTV is the number of vehicles involved in the case. Let's have a look at the distribution of this feature:
fig = px.bar(df["NUMTOTV"].value_counts().sort_index(), width=640)
fig.update_layout(title="number of cases by number of vehicles", xaxis_title="number of vehicles",
yaxis_title="number of cases")
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "num_vehicles"}})
Most cases involve two vehicles, and only very few accidents involve more than three vehicles.
Each of the columns WEATHER1 to WEATHER8 indicates the presence of a specific weather condition
(1: weather condition present, 9999: presence of weather condition unknown, 0 otherwise):
| column | meaning | count |
|---|---|---|
WEATHER1 |
cloudy | 1112 |
WEATHER2 |
snow | 114 |
WEATHER3 |
fog, smog, smoke | 28 |
WEATHER4 |
rain | 624 |
WEATHER5 |
sleet, hail (freezing drizzle or rain) | 25 |
WEATHER6 |
blowing snow | 38 |
WEATHER7 |
severe crosswinds | 20 |
WEATHER8 |
other | 25 |
These weather conditions are not mutually exclusive, i.e., more than one condition can be present in a single case. The frequency distribution looks as follows:
fig=px.bar(x=range(1,9), y=[(df["WEATHER"+str(i)]==1).sum() for i in range(1,9)], width=640)
fig.update_layout(title="number of cases by weather condition", xaxis_title="weather condition",
yaxis_title="number of cases")
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "weather"}})
The most frequently recorded weather conditions are "cloudy" (WEATHER1) and "rain" (WEATHER4).
INJSEVA indicates the most serious sustained injury in the accident.
For instance, if one person was not injured, and another person suffered a non-incapacitating injury,
injury class 2 was assigned to the case.
Information on injury severity has been taken from police accident reports, which are not available in the data.
Unfortunately, this information does not necessarily align with the case description:
There are many cases for which the case description indicates the presence of an injury,
but INJSEVA does not, and vice versa.
For this reason, we created manually an additional column INJSEVB based on the case description,
to indicate the presence of a (possible) bodily injury.
The table below shows the distribution of number of cases by the two variables.
INJSEVA |
meaning | count | INJSEVB=0 |
INJSEVB=1 |
|---|---|---|---|---|
| 0 | O - No injury | 1'458 | 96 | 1'554 |
| 1 | C - Possible injury | 1'112 | 1'298 | 2'410 |
| 2 | B - Non-incapacitating injury | 729 | 945 | 1'674 |
| 3 | A - Incapacitating injury | 304 | 373 | 677 |
| 4 | K - Killed | 5 | 114 | 119 |
| 5 | U - Injury, severity unknown | 44 | 122 | 166 |
| 6 | Died prior to crash | 0 | 0 | 0 |
| 9 | Unknown if injured | 51 | 16 | 67 |
| 10 | No person in crash | 1 | 0 | 1 |
| 11 | No PAR (police accident report) obtained | 231 | 50 | 281 |
| Total | 3'935 | 3'014 | 6'949 |
Now we turn to the verbal accident descriptions.
First, we examine the length of the English texts, SUMMARY_EN.
To this end, we split the texts into words, with blank spaces as separator,
and show a box plot of the text length by number of vehicles involved in the accident:
# statistics of summary length
df["words per case summary"] = df["SUMMARY_EN"].str.split().apply(len)
print(f"Overall number of words by case summary: min {df['words per case summary'].min()}, "
f"average {df['words per case summary'].mean():.0f}, max {df['words per case summary'].max()}")
fig = px.box(df, x="NUMTOTV", y="words per case summary", width=640)
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "text_length"}})
Overall number of words by case summary: min 60, average 419, max 1248
Not surprisingly, the length of the descriptions correlates with the number of vehicles involved.
The average length is above 400 words. As we will see later in this notebook, this poses some challenges with the NLP models that we are using in this notebook, because these are limited to text up to a length of 512 so-called "tokens" (vocabulary items). Since a single word may be tokenized into more than one token, some accident descriptions will be truncated.
Let's examine one of the English texts and its German translation:
display(HTML(df.loc[0, "SUMMARY_EN"]))
display(HTML(df.loc[0, "SUMMARY_GE"]))
To get an impression of the most frequent words, we generate a simple word cloud form all English case descriptions. By default, the word cloud excludes so-called stop words (such as articles, prepositions, pronouns, conjunctions, etc.), which are the most common words and do not add much information to the text.
text = df["SUMMARY_EN"].str.cat(sep=" ")
# Create and generate a word cloud image:
word_cloud = WordCloud(max_words=100, background_color="white").generate(text)
# Display the generated image:
fig = px.imshow(word_cloud, width=640)
fig.update_layout(xaxis_showticklabels=False, yaxis_showticklabels=False)
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "word_cloud"}})
This tutorial uses NLP models provided by HuggingFace.
HuggingFace is a community that builds, trains and deploys state-of-the-art models for natural language processing, audio, computer vision etc. HuggingFace's model hub provides thousands of pre-trained models for these applications. The Transformers library offers functionality to quickly download and use those pre-trained models on a given input, fine-tune them on the own datasets and then share them with the community. The library is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow.
In this notebook, the following elements of the HuggingFace ecosystem will be used:
In the next sections we will briefly explore the first three components in turn. The trainer functionality will be used in Section 4 of this notebook.
Datasets is a library for easily accessing and sharing datasets, and evaluation of metrics for NLP, computer vision, and audio tasks.
A dataset can be loaded in a single line of code, in our case directly from the pandas DataFrame. At the same time, we split the dataset into a training (80%) and a test dataset (20%). We fix the random seed for the sake of reproducibility.
dataset = Dataset.from_pandas(df).train_test_split(test_size=0.2, seed=0)
Since the texts are relatively long, some parts of this notebook require computing resources. Uncomment the following line to reduce the size of the dataset.
# dataset = DatasetDict({"train": dataset["train"].select(range(1000)), "test": dataset["train"].select(range(250))})
print(dataset)
DatasetDict({
train: Dataset({
features: ['level_0', 'index', 'SCASEID', 'SUMMARY_EN', 'SUMMARY_GE', 'INJSEVA', 'NUMTOTV', 'WEATHER1', 'WEATHER2', 'WEATHER3', 'WEATHER4', 'WEATHER5', 'WEATHER6', 'WEATHER7', 'WEATHER8', 'INJSEVB', 'words per case summary'],
num_rows: 5559
})
test: Dataset({
features: ['level_0', 'index', 'SCASEID', 'SUMMARY_EN', 'SUMMARY_GE', 'INJSEVA', 'NUMTOTV', 'WEATHER1', 'WEATHER2', 'WEATHER3', 'WEATHER4', 'WEATHER5', 'WEATHER6', 'WEATHER7', 'WEATHER8', 'INJSEVB', 'words per case summary'],
num_rows: 1390
})
})
The resulting DatasetDict behaves like a Python dictionary.
Therefore, you can access the Dataset corresponding to each split by
ds_train = dataset["train"]
print(ds_train)
Dataset({
features: ['level_0', 'index', 'SCASEID', 'SUMMARY_EN', 'SUMMARY_GE', 'INJSEVA', 'NUMTOTV', 'WEATHER1', 'WEATHER2', 'WEATHER3', 'WEATHER4', 'WEATHER5', 'WEATHER6', 'WEATHER7', 'WEATHER8', 'INJSEVB', 'words per case summary'],
num_rows: 5559
})
The Dataset object behaves like a normal Python container.
You can query its length, get rows or columns, etc. For instance, its length is:
len(ds_train)
5559
To query a single row, you can use its index, like in a list: ds_train[0].
This returns a dictionary representing the row.
Its elements can be accessed by the column names as keys,
e.g. ds_train[0]["SCASEID"].
Multiple rows can be accessed by index slices, e.g. dataset["train"][:2],
or by a list of indices, e.g. dataset["train"][0, 2].
You can list the column names and get their detailed types (called features):
ds_train.features
{'level_0': Value(dtype='int64', id=None),
'index': Value(dtype='int64', id=None),
'SCASEID': Value(dtype='int64', id=None),
'SUMMARY_EN': Value(dtype='string', id=None),
'SUMMARY_GE': Value(dtype='string', id=None),
'INJSEVA': Value(dtype='int64', id=None),
'NUMTOTV': Value(dtype='int64', id=None),
'WEATHER1': Value(dtype='int64', id=None),
'WEATHER2': Value(dtype='int64', id=None),
'WEATHER3': Value(dtype='int64', id=None),
'WEATHER4': Value(dtype='int64', id=None),
'WEATHER5': Value(dtype='int64', id=None),
'WEATHER6': Value(dtype='int64', id=None),
'WEATHER7': Value(dtype='int64', id=None),
'WEATHER8': Value(dtype='int64', id=None),
'INJSEVB': Value(dtype='int64', id=None),
'words per case summary': Value(dtype='int64', id=None)}
Later in this tutorial we will get to know methods to process datasets, such as filtering the rows based on conditions, and processing the data in each row.
Next, we convert the summary texts into tokens, i.e., the text strings are split into elements of the vocabulary of the NLP model.
As such, the tokenizer and the NLP model need to be aligned. Changing the tokenizer after training the model would produce unpredictable results.
Let's start with the model
distilbert-base-multilingual-cased.
As the name implies, this model is cased: it does make a difference between "english" and "English".
The model is trained on the concatenation of Wikipedia in 104 different languages listed here. The model has 6 layers, 768 dimensions and 12 heads, totalizing 134 million parameters. This model is a distilled version of the BERT base multilingual model which has 177 million parameters. On average, the distilled model is twice as fast as the original model.
If you want to use another model throughout this notebook, please feel free to simply change the following line!
model_name = "distilbert-base-multilingual-cased"
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"Tokenizer vocab_size: {tokenizer.vocab_size}")
print(f"Tokenizer model_max_length (maximum context size): {tokenizer.model_max_length}")
Tokenizer vocab_size: 119547 Tokenizer model_max_length (maximum context size): 512
As we can see, the tokenizer has a vocabulary of size 119'547. The maximum sequence length of the model is 512 tokens.
To see the tokenizer in action, we tokenize the first sentence of an accident description:
text = "V1, a 2000 Pontiac Montana minivan, made a left turn from a private driveway onto a northbound 5-lane two-way, dry asphalt roadway on a downhill grade."
result = tokenizer(text)
Calling the tokenizer returns a BatchEncoding object,
which behaves just like a standard Python dictionary that holds input items used by the NP model.
input_ids is the list of token IDs for each token.
attention_mask is a list containing 1 for all elements that corresponds to tokens of the input text,
and 0 for padding tokens that are appended to attain a specified sequence length.
To illustrate the meaning of the input IDs, we convert them back to token strings:
print(result)
print(tokenizer.convert_ids_to_tokens(result["input_ids"]))
{'input_ids': [101, 159, 10759, 117, 169, 10180, 23986, 46917, 24408, 25103, 12955, 117, 11019, 169, 12153, 18923, 10188, 169, 14591, 23806, 14132, 31095, 169, 12756, 47755, 126, 118, 23636, 10551, 118, 13170, 117, 36796, 28438, 27015, 15485, 14132, 10135, 169, 12935, 32049, 21958, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'V', '##1', ',', 'a', '2000', 'Pont', '##iac', 'Montana', 'mini', '##van', ',', 'made', 'a', 'left', 'turn', 'from', 'a', 'private', 'drive', '##way', 'onto', 'a', 'north', '##bound', '5', '-', 'lane', 'two', '-', 'way', ',', 'dry', 'asp', '##halt', 'road', '##way', 'on', 'a', 'down', '##hill', 'grade', '.', '[SEP]']
We observe that words like "V1", "Pontiac", "minivan", "driveway" etc. are split into multiple tokens each.
This is typical for WordPiece tokenization adopted by BERT, an approach designed to reduce vocabulary size.
This tokenizer marks sub-words by the prefix ##.
It is interesting to note that 2000 is a separate element of the vocabulary.
The first and last tokens of the tokenized sequence are CLS and SEP, respectively.
CLS stands for "classification".
The output of the BERT encoder corresponding to this input token is sometimes interpreted to represent the meaning of
the entire sequence (we will check this in Section 3.2 of this notebook).SEP stands for "separation".
In next-sequence prediction tasks, it is used to separate the first from the second sequence.Here is a list of other special tokens used by the BERT tokenizer:
UNK token is used to represent tokens that are not available in the dictionary.PAD token is used to pad the length of the tokenized sequence to a fixed length.
A fixed length is required when multiple sequences of different length are tokenized and fed into a BERT model
at the same time.MASK token is used for pre-training the BERT model by masked language modeling.
For this task, the model is used to predict the masked token.print(f"Tokenizer special_tokens_map: {tokenizer.special_tokens_map}")
Tokenizer special_tokens_map: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
It is instructive to look at the tokenization of the German translation of the same text:
text = "V1, ein Minivan der Marke Pontiac Montana aus dem Jahr 2000, bog von einer privaten Einfahrt nach links auf eine zweispurige, trockene Asphaltstraße mit 5 Fahrspuren in nördlicher Richtung und einem Gefälle ab."
result = tokenizer(text)
print(result)
print(tokenizer.convert_ids_to_tokens(result["input_ids"]))
{'input_ids': [101, 159, 10759, 117, 10290, 32930, 12955, 10118, 73879, 23986, 46917, 24408, 10441, 10268, 11218, 10180, 117, 66298, 10166, 10599, 73655, 12210, 25131, 10496, 23608, 10329, 10359, 11615, 54609, 13091, 10525, 117, 42169, 21181, 10112, 10882, 37590, 72847, 43968, 10221, 126, 44271, 16757, 54609, 30064, 10106, 28253, 10165, 20139, 10130, 10745, 144, 16822, 38064, 11357, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'V', '##1', ',', 'ein', 'Mini', '##van', 'der', 'Marke', 'Pont', '##iac', 'Montana', 'aus', 'dem', 'Jahr', '2000', ',', 'bog', 'von', 'einer', 'privaten', 'Ein', '##fahrt', 'nach', 'links', 'auf', 'eine', 'zwei', '##sp', '##uri', '##ge', ',', 'tro', '##cken', '##e', 'As', '##pha', '##lts', '##traße', 'mit', '5', 'Fa', '##hr', '##sp', '##uren', 'in', 'nördlich', '##er', 'Richtung', 'und', 'einem', 'G', '##ef', '##älle', 'ab', '.', '[SEP]']
Tokenizers of multi-lingual models use the same vocabulary for all languages. Obviously, the tokenizer simply splits the input string into pieces and does not perform any translation: the English pronoun "a" (169) is a different token than the equivalent German "ein" (10290).
We observe that the tokenizer is case-sensitive:
It differentiates between the tokens mini (25103) and Mini (32930).
So far, we have tokenized single sentences only.
Next, we want to tokenize the entire dataset.
This is easily achieved by applying the map function to the dataset.
All we need to provide to the map function is a function that takes a record or a batch of records from the dataset,
applies an operation to it, and returns a DataSet or a dict which defines the columns to be added or updated.
In our case, we supply a function that calls the tokenizer as shown before.
As we have seen, calling the tokenizer returns a dict with the keys input_ids and attention_mask.
Therefore, the map function will add columns with these names to the original dataset.
Since we plan to feed the tokenized sequences into a transformer model, we need to truncate their length to the maximum length accepted by the transformer. Moreover, the shorter sequences need to be padded at the end, so that all tokenized sequences have the same length.
Overall, only a few lines of code are required to complete the tokenization:
# define a function to tokenize a batch
def tokenize(batch, column):
return tokenizer(batch[column], truncation=True, padding=True)
# encode the full dataset
dataset_en = dataset.map(tokenize, batched=True, fn_kwargs={"column": "SUMMARY_EN"})
print(dataset_en["train"].column_names)
['level_0', 'index', 'SCASEID', 'SUMMARY_EN', 'SUMMARY_GE', 'INJSEVA', 'NUMTOTV', 'WEATHER1', 'WEATHER2', 'WEATHER3', 'WEATHER4', 'WEATHER5', 'WEATHER6', 'WEATHER7', 'WEATHER8', 'INJSEVB', 'words per case summary', 'input_ids', 'attention_mask']
The additional argument column is passed to tokenize via the the dictionary fn_kwargs.
As we can see from the progress bars, the map function gets called twice - once for each split.
As expected, new columns input_ids and attention_mask have been added to the dataset.
We repeat the same procedure for the German texts.
dataset_ge = dataset.map(tokenize, batched=True, fn_kwargs={"column": "SUMMARY_GE"})
Later on, we will also use a dataset which has 80% English texts and 20% German texts:
def map_mixed(x, idx):
return {"SUMMARY_MX" : x["SUMMARY_GE"] if idx % 5 == 0 else x["SUMMARY_EN"]}
dataset = dataset.map(map_mixed, batched=False, with_indices=True)
dataset_mx = dataset.map(tokenize, batched=True, fn_kwargs={"column": "SUMMARY_MX"})
Now we have created three datasets - with the tokenized English, German and mixed language texts, respectively.
We could have stored the results in a single dataset (with different column names), but keeping languages separately will make it easier to convince ourselves in the following examples that the languages have not been mixed up!
After completing the tokenization of the raw texts, we are ready to apply the transformer model, in our case the multilingual DistilBERT model.
First, we load the model. To speed up the following calculations, we opt for GPU support if available.
# load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42) # for reproducibility, set random seed before instantiating the model
model = AutoModel.from_pretrained(model_name).to(device)
Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias'] - This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The warning message can be ignored for our application.
Let's examine the model structure:
model
DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(119547, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): Transformer(
(layer): ModuleList(
(0): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(1): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(2): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(3): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(4): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(5): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
)
)
)
As we can see, the first block of the model deals with embeddings, with the word embedding as the first layer. This is followed by the transformer which consists of 6 transformer blocks.
Let's first explore the word embedding.
The goal of the word embedding layer is to assign each element of the vocabulary a vector of length $E$.
The multilingual DistilBERT model has a vocabulary of size $V=119'547$ and a word embedding size of $E=768$. We can confirm this by looking at the dimension of the word embedding weight tensor:
model.embeddings.word_embeddings
Embedding(119547, 768, padding_idx=0)
To see the outputs of the transformer encoder, let's apply the transformer to the first record of the dataset,
more precisely to its columns input_ids and attention_mask, the outputs of the tokenizer:
example = dataset_en["train"][:1]
input_ids = torch.tensor(example["input_ids"]).to(device)
attention_mask = torch.tensor(example["attention_mask"]).to(device)
with torch.no_grad():
output = model(input_ids, attention_mask)
print(output)
BaseModelOutput(last_hidden_state=tensor([[[ 0.1148, -0.0254, 0.1447, ..., 0.1937, 0.0804, -0.2158],
[ 0.1216, -0.5199, 0.6924, ..., 0.2711, -0.2492, -0.0172],
[-0.4065, -0.0786, 0.3362, ..., -0.2183, 0.0278, 0.1635],
...,
[-0.1276, -0.4791, -0.1539, ..., 0.0442, -0.2272, 0.1089],
[-0.1577, -0.4097, -0.2176, ..., 0.0154, -0.2008, -0.1374],
[-0.1855, -0.4261, -0.1884, ..., -0.0515, -0.0600, -0.3426]]],
device='cuda:0'), hidden_states=None, attentions=None)
This produces a BaseModelOutput object which has a named property last_hidden_state,
a tensor that represents the hidden state of the final transformer block, i.e. the encoded text sequence!
The dimension of the last hidden state is:
print("dimensions of last hidden state: ", output.last_hidden_state.size())
dimensions of last hidden state: torch.Size([1, 512, 768])
i.e., [number of samples (1), sequence length $T$ (maximum 512 tokens), embedding size $E$ (768)].
In what follows, we will use the information contained in this tensor to make predictions.
In this section you will learn how transformers can be used to extract features from text data for a classification or regression problem.
The idea is simple: The tokenized raw text data is encoded by the transformer model, and the features are extracted from the last hidden state.
Before we have seen that the DistilBERT model encodes each token of each input sample into a tensor of length $E=768$. As such, the output of the transformer model depends on the length of the input sequences. To make predictions, we would prefer having a single vector per input sample, independent of the sequence length.
Different approaches are available to achieve this goal:
CLS token, which is the first token of the input sequence in BERT models.PAD token should be excluded because they don't carry any information.We will implement both techniques and compare results.
In the following cell we display a short function which applies the NLP model to a batch of encoded input samples, extracts the last hidden state, and returns two tensors of length 768 for each input sample, corresponding to the two methods explained before.
The cell is not executable, because the function is already defined in the module tutorial_utils we imported initially.
Let's apply this function to the first sample of the training data:
example = dataset_en["train"][:1]
result = extract_sequence_encoding(example, model)
print(result.keys())
dict_keys(['level_0', 'index', 'SCASEID', 'SUMMARY_EN', 'SUMMARY_GE', 'INJSEVA', 'NUMTOTV', 'WEATHER1', 'WEATHER2', 'WEATHER3', 'WEATHER4', 'WEATHER5', 'WEATHER6', 'WEATHER7', 'WEATHER8', 'INJSEVB', 'words per case summary', 'input_ids', 'attention_mask', 'cls_hidden_state', 'mean_hidden_state'])
As desired, two additional columns cls_hidden_state and mean_hidden_state were appended.
Therefore, the function can be supplied to the familiar map function
to add corresponding columns to the original dataset.
The following lines do this for the full datasets.
On an AWS EC2 p2.xlarge instance, the run time is amore than 10 minutes. We save the resulting datasets to disk.
dataset_en = dataset_en.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_ge = dataset_ge.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_mx = dataset_mx.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_en.save_to_disk("./datasets/dataset_en")
dataset_ge.save_to_disk("./datasets/dataset_ge")
dataset_mx.save_to_disk("./datasets/dataset_mx")
We will now use the encoded texts as features to predict labels taken from certain tabular information available in the dataset.
To this end, we use the following convenience functions implemented in tutorial_utils.py:
x_train, y_train, x_test, y_test = get_xy(dataset, features, label)
get numpy arrays corresponding features (x) and label (y) corresponding to the train and test split of the datasetwhere the encoded sentences are stored in the column features and the labels in the column label.
clf = logistic_regression_classifier(x, y, c=1)
fit and return a multinomial Logistic Regression classifier to features x, and labels y. L2-penalty is controlled by the hyper-parameter c.
clf = dummy_classifier(x, y):
fit and return a Dummy classifier to features x, and labels y. This classifier predicts always the most frequent class and predict_proba always returns the empirical class distribution of y.
score_accuracy, score_log, score_brier, confusion_matrix, fig = evaluate_classifier(y_true, y_pred, p_pred, target_names, display_title_string, file_name)
Calculate and display performance metrics of a classifier. The return value fig is a ploty figure representing the confusion matrix plot. The following inputs are expected:
y_true (array-like);y_pred (array_like), in which case the log loss and Brier score are not evaluated;p_pred (array_like);None.Now the toolbox is ready!
Next, we apply it to a simple classification task.
In this case study, we will predict the number of vehicles involved in an accident from the verbal accident description.
Since the data set contains the column NUMTOTV, we can adopt a supervised learning approach.
We might consider framing the problem as a regression task, e.g. using Poisson regression. However, looking at the frequenca distribution of NUMTOTV, it apears unlikely that the Poisson distribution is a good reflection of reality. First, there are no accidents with zero vehicles involved - it takes at least one. So we might consider using a zero-truncated Poisson model. However, the empirical frequency distribution has low mass at high vehicle counts, so that this would not be a plausible model either.
Therefore, we frame the prediction task as multinomial classification. Given that only a small fraction of cases involves four or more vehicles, and to avoid a heavily imbalanced classification problem, we map these cases to an aggregated class "3+".
To achieve this, we map the column NUMTOTV to a new column labels, with levels 0 (1 vehicle), 1 (2 vehicles) and 2 (3 or more vehicles).
We choose the column name labels because this is expected by the sequence classification model which we fit in Section 4.2.
dataset_en = load_from_disk("./datasets/dataset_en")
dataset_ge = load_from_disk("./datasets/dataset_ge")
dataset_mx = load_from_disk("./datasets/dataset_mx")
# map number of vehicles to a new column "labels"
labels = ["1", "2", "3+"]
d = {i: min(i-1, 2) for i in range(1,10)}
dataset_en = dataset_en.map(lambda x: {"labels": d[x["NUMTOTV"]]})
dataset_ge = dataset_ge.map(lambda x: {"labels": d[x["NUMTOTV"]]})
dataset_mx = dataset_mx.map(lambda x: {"labels": d[x["NUMTOTV"]]})
print(dataset_en["train"]["NUMTOTV"][:40])
print(dataset_en["train"]["labels"][:40])
[2, 1, 2, 2, 2, 2, 2, 1, 2, 3, 2, 3, 2, 1, 3, 4, 1, 3, 1, 2, 1, 2, 2, 4, 2, 2, 2, 4, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2] [1, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 1, 0, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1]
As explained in Section 3.1, we will explore two different ways to use encoded texts:
CLS token, which is the first token of the input sequence in BERT models.Let's start with the first approach by using the feature cls_hidden_state produced in Section 3.1.
Using the toolbox developed before we fit a dummy classifier and a logistic regression classifier to the features and labels of the English dataset.
# extract the transformer encoding corresponding to the the CLS token
x_train_en, y_train_en, x_test_en, y_test_en = get_xy(dataset_en, "cls_hidden_state", "labels")
# fit dummy classifier
clf_dummy = dummy_classifier(x_train_en, y_train_en)
_ = evaluate_classifier(y_test_en, None, clf_dummy.predict_proba(x_test_en), labels, "Dummy classifier", "cm_nv_dummy")
Dummy classifier
accuracy score = 57.2%, log loss = 0.961, Brier loss = 0.574
classification report
precision recall f1-score support
1 0.00 0.00 0.00 389
2 0.57 1.00 0.73 795
3+ 0.00 0.00 0.00 206
accuracy 0.57 1390
macro avg 0.19 0.33 0.24 1390
weighted avg 0.33 0.57 0.42 1390
# fit a classifier to the encoded English texts
clf_en = logistic_regression_classifier(x_train_en, y_train_en, c=10)
_ = evaluate_classifier(y_test_en, None, clf_en.predict_proba(x_test_en), labels, "Logistic regression (a)", "cm_nv_lr_a")
Logistic regression (a)
accuracy score = 90.9%, log loss = 0.275, Brier loss = 0.146
classification report
precision recall f1-score support
1 0.94 0.93 0.93 389
2 0.89 0.96 0.92 795
3+ 0.92 0.68 0.78 206
accuracy 0.91 1390
macro avg 0.92 0.85 0.88 1390
weighted avg 0.91 0.91 0.91 1390
We obtain an accuracy score of 91%, compared to 57% with the dummy classifier. This is already a very good result!
Remember, we have just used the DistilBERT transformer off the shelf, with no tuning whatsoever, to extract a vector of length 768 representing the information contained in the accident descriptions. During this entire text encoding, the transformer model was unaware that its output was going to be used to predict the number of vehicles.
How about the second approach, which uses the feature mean_hidden_state that was extracted
by mean pooling over the entire encoded sequence?
Let's see:
x_train_en, y_train_en, x_test_en, y_test_en = get_xy(dataset_en, "mean_hidden_state", "labels")
clf_en = logistic_regression_classifier(x_train_en, y_train_en, c=10)
_ = evaluate_classifier(y_test_en, None, clf_en.predict_proba(x_test_en), labels, "Logistic regression (b), train EN, test EN", "cm_nv_EN_EN")
Logistic regression (b), train EN, test EN
accuracy score = 96.0%, log loss = 0.127, Brier loss = 0.063
classification report
precision recall f1-score support
1 0.96 0.97 0.97 389
2 0.95 0.98 0.97 795
3+ 0.99 0.86 0.92 206
accuracy 0.96 1390
macro avg 0.97 0.94 0.95 1390
weighted avg 0.96 0.96 0.96 1390
Again, we have used DistilBERT without any fine-tuning.
For the present task, by any of the considered scores, mean pooling performs much better than using the encoding of the CLS token.
For this reason, we use mean pooling in what follows.
What would you guess - will the classifier model exhibit a similar performance when trained on the encoded German dataset?
Let's check:
x_train_ge, y_train_ge, x_test_ge, y_test_ge = get_xy(dataset_ge, "mean_hidden_state", "labels")
clf_ge = logistic_regression_classifier(x_train_ge, y_train_ge, c=10)
_, _, _, _, _ = evaluate_classifier(y_test_ge, None, clf_ge.predict_proba(x_test_ge), labels, "train GE, test GE", "cm_nv_GE_GE")
train GE, test GE
accuracy score = 96.0%, log loss = 0.120, Brier loss = 0.062
classification report
precision recall f1-score support
1 0.97 0.98 0.97 389
2 0.95 0.98 0.97 795
3+ 0.96 0.86 0.91 206
accuracy 0.96 1390
macro avg 0.96 0.94 0.95 1390
weighted avg 0.96 0.96 0.96 1390
Yes indeed, the performance on the English and German datasets are comparable. This is what we would have expected - after all we are using a multilingual transformer model.
In practice, it might happen that training data is available (predominantly) in one language, but we would like to apply the model to test data in another language. Translating the test data to the language of the training data would be an option, but let's see how the multilingual transformer model performs.
In our small experiment, we simply switch the languages of the test sets. This might be hard for the models, since in the entire training process each model has seen only encoded input from text samples in one language!
First, use the German test set for the model trained on English input:
_ = evaluate_classifier(y_test_ge, None, clf_en.predict_proba(x_test_ge), labels, "train EN, test GE", "cm_nv_EN_GE")
train EN, test GE
accuracy score = 66.0%, log loss = 1.083, Brier loss = 0.527
classification report
precision recall f1-score support
1 1.00 0.16 0.27 389
2 0.67 0.86 0.75 795
3+ 0.57 0.85 0.68 206
accuracy 0.66 1390
macro avg 0.75 0.62 0.57 1390
weighted avg 0.75 0.66 0.61 1390
From these rather poor results, we conclude that this approach to cross-language transferability does not work.
Vice versa, use the English test set for the model based on German input:
_ = evaluate_classifier(y_test_en, None, clf_ge.predict_proba(x_test_en), labels, "train GE, test EN", "cm_nv_GE_EN")
train GE, test EN
accuracy score = 24.3%, log loss = 8.053, Brier loss = 1.361
classification report
precision recall f1-score support
1 0.00 0.00 0.00 389
2 0.40 0.17 0.24 795
3+ 0.19 0.99 0.32 206
accuracy 0.24 1390
macro avg 0.20 0.39 0.19 1390
weighted avg 0.26 0.24 0.18 1390
Again, performance is unsatisfactory.
To improve results, we need to change the approach.
In a multilingual situation, a possible approach is to train the classifier with a training set consisting of encoded samples from both languages. This can always be achieved by translating a fraction of the text data and then use it to train the model.
This is exactly what we are going to do next.
In order to simulate a situation where one language is underrepresented, we create a mixed-language dataset
with about 80% English and 20% German samples, our dataset dataset_mx produced in Section 2.2.
Since we are already using a multilingual transformer model, no further changes are required.
x_train_mx, y_train_mx, x_test_mx, y_test_mx = get_xy(dataset_mx, "mean_hidden_state", "labels")
clf_mx = logistic_regression_classifier(x_train_mx, y_train_mx, c=10)
_ = evaluate_classifier(y_test_en, None, clf_mx.predict_proba(x_test_en), labels, "train EN/GE, test EN", "cm_nv_MX_EN")
_ = evaluate_classifier(y_test_ge, None, clf_mx.predict_proba(x_test_ge), labels, "train EN/GE, test GE", "cm_nv_MX_GE")
train EN/GE, test EN
accuracy score = 95.7%, log loss = 0.136, Brier loss = 0.068
classification report
precision recall f1-score support
1 0.96 0.98 0.97 389
2 0.95 0.97 0.96 795
3+ 0.97 0.85 0.90 206
accuracy 0.96 1390
macro avg 0.96 0.93 0.95 1390
weighted avg 0.96 0.96 0.96 1390
train EN/GE, test GE
accuracy score = 95.2%, log loss = 0.160, Brier loss = 0.080
classification report
precision recall f1-score support
1 0.96 0.97 0.97 389
2 0.95 0.97 0.96 795
3+ 0.94 0.85 0.90 206
accuracy 0.95 1390
macro avg 0.95 0.93 0.94 1390
weighted avg 0.95 0.95 0.95 1390
This is a very good outcome. The scores are close to those achieved in the situation with a single-language!
To conclude, a multi-lingual situation can be handled by a multi-lingual transformer model. For the best performance, the classifier should be trained on the encoded sequences from all languages.
In the previous case study, we have used the DistilBERT model without any adaptation to the text data at hand, simply by using the sequence encoding produced by the model. As such, the language representation, which the model has learned from a large corpus of multilingual data, is transferred to the text data at hand. This approach is called transfer learning. The advantage of transfer learning is that a powerful (but relatively complex) model can be trained on a large corpus of data, using large-scale computing power, and then be applied to situations where availability of data or computing power would not allow for such complex models.
For the task at hand, the results are already very good. However, in certain situations it might be required to further improve model performance.
In the following sections you will learn how to fine-tune a transformer model. We will explore two approaches to fine-tuning:
The advantage of the first approach is that it can be performed in an unsupervised fashion, i.e., it does not require labeled data.
On the other hand, task-specific fine-tuning is expected to produce better performance on the particular task which the model was tuned for, so it might be the method of choice if there is a single down-stream task and sufficient labeled data.
Let's explore these two fine-tuning approaches in turn.
Domain-specific fine-tuning can be achieved by applying the model to a "masked language modeling" task. This involves taking a sentence, randomly masking a certain percentage of the words in the input, and then running the entire masked sentence through the model which has to predict the masked words. This self-supervised approach is an automatic process to generate inputs and labels from the texts and does not require any humans labelling in any way.
This is very easy to implement using the Transformers library. You will see three new elements of the Transformer library in action:
AutoModelForMaskedLM class loads the DistilBERT model with a model head suitable for the masked language
modeling task.DataCollatorForLanguageModeling class forms training batches from the dataset and handles the masking.Trainer class provides the interface to train the model.Depending on the hardware available, training might take a rather long time. Therefore, if available, we use GPU support. On an AWS EC2 p2.xlarge instance, the run time is about 55 minutes. We store the trained model for later use.
If you do not have enough time to perform this step right now, you can skip this section and return later. The remainder of this notebook does not depend on it.
# load model and tokenizer and define the DataCollator
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42) # for reproducibility, set random seed before instantiating the model
model_mlm = AutoModelForMaskedLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
dataset_mx = load_from_disk("./datasets/dataset_mx")
# define training arguments
training_args = TrainingArguments(
output_dir="models/" + model_name + "_mlm_epochs",
overwrite_output_dir=True,
num_train_epochs=2,
per_device_train_batch_size=4,
save_strategy=trainer_utils.IntervalStrategy.NO,
)
trainer = Trainer(
model=model_mlm,
args=training_args,
data_collator=data_collator,
train_dataset=dataset_mx["train"]
)
trainer.train()
trainer.save_model("models/" + model_name + "_mlm")
The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: WEATHER7, WEATHER4, SUMMARY_GE, cls_hidden_state, WEATHER1, WEATHER6, SUMMARY_MX, INJSEVA, SCASEID, mean_hidden_state, NUMTOTV, WEATHER8, SUMMARY_EN, WEATHER3, index, words per case summary, WEATHER2, INJSEVB, level_0, WEATHER5. If WEATHER7, WEATHER4, SUMMARY_GE, cls_hidden_state, WEATHER1, WEATHER6, SUMMARY_MX, INJSEVA, SCASEID, mean_hidden_state, NUMTOTV, WEATHER8, SUMMARY_EN, WEATHER3, index, words per case summary, WEATHER2, INJSEVB, level_0, WEATHER5 are not expected by `DistilBertForMaskedLM.forward`, you can safely ignore this message. /home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning ***** Running training ***** Num examples = 5559 Num Epochs = 2 Instantaneous batch size per device = 4 Total train batch size (w. parallel, distributed & accumulation) = 4 Gradient Accumulation steps = 1 Total optimization steps = 2780
| Step | Training Loss |
|---|---|
| 500 | 1.414500 |
| 1000 | 1.119100 |
| 1500 | 1.020200 |
| 2000 | 0.938100 |
| 2500 | 0.877800 |
Training completed. Do not forget to share your model on huggingface.co/models =) Saving model checkpoint to models/distilbert-base-multilingual-cased_mlm Configuration saved in models/distilbert-base-multilingual-cased_mlm/config.json Model weights saved in models/distilbert-base-multilingual-cased_mlm/pytorch_model.bin
Now, model_mlm holds the DistilBERT model, fine-tuned to the mixed-language accident descriptions
using masked-language-modeling.
Next, we apply this model to all input sequences and extract the last hidden state. The procedure is the same as in section 3.1. To avoid confusion, we create new datasets, and store them on disk for later use, so that this step does not need to be repeated all over when this notebook is re-run.
dataset_en = load_from_disk("./datasets/dataset_en")
dataset_ge = load_from_disk("./datasets/dataset_ge")
dataset_mx = load_from_disk("./datasets/dataset_mx")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("models/" + model_name + "_mlm").to(device)
dataset_en_pretrained = dataset_en.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_ge_pretrained = dataset_ge.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_mx_pretrained = dataset_mx.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_en_pretrained.save_to_disk("./datasets/dataset_en_pretrained")
dataset_ge_pretrained.save_to_disk("./datasets/dataset_ge_pretrained")
dataset_mx_pretrained.save_to_disk("./datasets/dataset_mx_pretrained")
loading configuration file models/distilbert-base-multilingual-cased_mlm/config.json
Model config DistilBertConfig {
"_name_or_path": "models/distilbert-base-multilingual-cased_mlm",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"output_past": true,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"torch_dtype": "float32",
"transformers_version": "4.19.2",
"vocab_size": 119547
}
loading weights file models/distilbert-base-multilingual-cased_mlm/pytorch_model.bin
Some weights of the model checkpoint at models/distilbert-base-multilingual-cased_mlm were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of DistilBertModel were initialized from the model checkpoint at models/distilbert-base-multilingual-cased_mlm.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertModel for predictions without further training.
dataset_en_pretrained = load_from_disk("./datasets/dataset_en_pretrained")
dataset_ge_pretrained = load_from_disk("./datasets/dataset_ge_pretrained")
dataset_mx_pretrained = load_from_disk("./datasets/dataset_mx_pretrained")
# map number of vehicles to a new column "labels"
labels = ["1", "2", "3+"]
d = {i: min(i-1, 2) for i in range(1,10)}
dataset_en = dataset_en_pretrained.map(lambda x: {"labels": d[x["NUMTOTV"]]})
dataset_ge = dataset_ge_pretrained.map(lambda x: {"labels": d[x["NUMTOTV"]]})
dataset_mx = dataset_mx_pretrained.map(lambda x: {"labels": d[x["NUMTOTV"]]})
# extract features and labels and creade multi-lingual dataset
x_train_en, y_train_en, x_test_en, y_test_en = get_xy(dataset_en, "mean_hidden_state", "labels")
x_train_ge, y_train_ge, x_test_ge, y_test_ge = get_xy(dataset_ge, "mean_hidden_state", "labels")
x_train_mx, y_train_mx, x_test_mx, y_test_mx = get_xy(dataset_mx, "mean_hidden_state", "labels")
# fit logistic regression classifiers to each of the three datasets and (cross-) evaluate them
clf_en = logistic_regression_classifier(x_train_en, y_train_en, c=10)
_ = evaluate_classifier(y_test_en, None, clf_en.predict_proba(x_test_en), labels, "train EN, test EN", "cm_nv_pr_EN_EN")
_ = evaluate_classifier(y_test_ge, None, clf_en.predict_proba(x_test_ge), labels, "train EN, test GE", "cm_nv_pr_EN_GE")
train EN, test EN
accuracy score = 97.1%, log loss = 0.091, Brier loss = 0.044
classification report
precision recall f1-score support
1 0.96 0.99 0.97 389
2 0.97 0.98 0.97 795
3+ 0.98 0.91 0.94 206
accuracy 0.97 1390
macro avg 0.97 0.96 0.96 1390
weighted avg 0.97 0.97 0.97 1390
train EN, test GE
accuracy score = 41.4%, log loss = 1.801, Brier loss = 0.890
classification report
precision recall f1-score support
1 1.00 0.08 0.16 389
2 0.58 0.43 0.49 795
3+ 0.26 0.99 0.42 206
accuracy 0.41 1390
macro avg 0.61 0.50 0.35 1390
weighted avg 0.65 0.41 0.39 1390
clf_ge = logistic_regression_classifier(x_train_ge, y_train_ge, c=10)
_ = evaluate_classifier(y_test_ge, None, clf_ge.predict_proba(x_test_ge), labels, "train GE, test GE", "cm_nv_pr_GE_GE")
_ = evaluate_classifier(y_test_en, None, clf_ge.predict_proba(x_test_en), labels, "train GE, test EN", "cm_nv_pr_GE_EN")
train GE, test GE
accuracy score = 96.9%, log loss = 0.104, Brier loss = 0.051
classification report
precision recall f1-score support
1 0.97 0.98 0.98 389
2 0.97 0.98 0.97 795
3+ 0.96 0.91 0.94 206
accuracy 0.97 1390
macro avg 0.97 0.96 0.96 1390
weighted avg 0.97 0.97 0.97 1390
train GE, test EN
accuracy score = 64.3%, log loss = 3.640, Brier loss = 0.687
classification report
precision recall f1-score support
1 0.00 0.00 0.00 389
2 0.62 1.00 0.76 795
3+ 0.99 0.49 0.65 206
accuracy 0.64 1390
macro avg 0.54 0.49 0.47 1390
weighted avg 0.50 0.64 0.53 1390
clf_mx = logistic_regression_classifier(x_train_mx, y_train_mx, c=10)
_ = evaluate_classifier(y_test_en, None, clf_mx.predict_proba(x_test_en), labels, "train EN/GE, test EN", "cm_nv_pr_MX_EN")
_ = evaluate_classifier(y_test_ge, None, clf_mx.predict_proba(x_test_ge), labels, "train EN/GE, test GE", "cm_nv_pr_MX_GE")
train EN/GE, test EN
accuracy score = 97.1%, log loss = 0.095, Brier loss = 0.046
classification report
precision recall f1-score support
1 0.97 0.99 0.98 389
2 0.97 0.98 0.98 795
3+ 0.98 0.90 0.94 206
accuracy 0.97 1390
macro avg 0.97 0.96 0.96 1390
weighted avg 0.97 0.97 0.97 1390
train EN/GE, test GE
accuracy score = 96.3%, log loss = 0.133, Brier loss = 0.063
classification report
precision recall f1-score support
1 0.97 0.97 0.97 389
2 0.96 0.97 0.97 795
3+ 0.94 0.90 0.92 206
accuracy 0.96 1390
macro avg 0.96 0.95 0.95 1390
weighted avg 0.96 0.96 0.96 1390
By comparing to the above results, we observe that the domain-specific fine-tuning on the English training set has improved the scores, but not to a satisfactory level for the cross-language transfer cases.
An alternative to domain-specific fine-tuning is task-specific fine-tuning.
The idea is to train a transformer model directly on the task at hand, in our case a sequence classification task.
The process is very similar to the masked language modeling used for domain-specific pre-training, except that
we load a sequence classification model using the class AutoModelForSequenceClassification.
The following code tunes a sequence classification model that uses the English accident descriptions to predict the number of vehicles involved. On an AWS EC2 p2.xlarge instance, the run time is about 20 minutes.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42) # for reproducibility, set random seed before instantiating the model
model_cls = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(labels)).to(device)
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}
# train the model
batch_size = 8
logging_steps = len(dataset_en["train"]) // batch_size
training_args = TrainingArguments(
output_dir="models/" + model_name + "nv_epochs",
num_train_epochs=2,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
metric_for_best_model="f1",
logging_steps=logging_steps,
save_strategy=trainer_utils.IntervalStrategy.NO,
)
trainer = Trainer(model=model_cls, args=training_args,
compute_metrics=compute_metrics, train_dataset=dataset_en["train"],
eval_dataset=dataset_en["test"])
trainer.train();
trainer.save_model("models/" + model_name + "_nv")
loading configuration file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/cf37a9dc282a679f121734d06f003625d14cfdaf55c14358c4c0b8e7e2b89ac9.7a727bd85e40715bec919a39cdd6f0aba27a8cd488f2d4e0f512448dcd02bf0f
Model config DistilBertConfig {
"_name_or_path": "distilbert-base-multilingual-cased",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2"
},
"initializer_range": 0.02,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2
},
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"output_past": true,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"transformers_version": "4.19.2",
"vocab_size": 119547
}
loading weights file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/pytorch_model.bin from cache at /home/ubuntu/.cache/huggingface/transformers/7b48683e2e7ba71cd1d7d6551ac325eceee01db5c2f3e81cfbfd1ee7bb7877f2.c24097b0cf91dbc66977325325fd03112f0f13d0e3579abbffc8d1e45f8d0619
Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: WEATHER7, WEATHER4, SUMMARY_GE, cls_hidden_state, WEATHER1, WEATHER6, INJSEVA, SCASEID, mean_hidden_state, NUMTOTV, WEATHER8, SUMMARY_EN, WEATHER3, index, words per case summary, WEATHER2, INJSEVB, level_0, WEATHER5. If WEATHER7, WEATHER4, SUMMARY_GE, cls_hidden_state, WEATHER1, WEATHER6, INJSEVA, SCASEID, mean_hidden_state, NUMTOTV, WEATHER8, SUMMARY_EN, WEATHER3, index, words per case summary, WEATHER2, INJSEVB, level_0, WEATHER5 are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning:
This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
***** Running training *****
Num examples = 5559
Num Epochs = 2
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 1390
| Step | Training Loss |
|---|---|
| 694 | 0.319600 |
| 1388 | 0.079000 |
Training completed. Do not forget to share your model on huggingface.co/models =) Saving model checkpoint to models/distilbert-base-multilingual-cased_nv Configuration saved in models/distilbert-base-multilingual-cased_nv/config.json Model weights saved in models/distilbert-base-multilingual-cased_nv/pytorch_model.bin
# evaluate model performance using predictions on the English test set
predictions_en = trainer.predict(dataset_en["test"])
_ = evaluate_classifier(predictions_en.label_ids, None, softmax(predictions_en.predictions, axis=1), labels, "train EN, test EN", "cm_nv_tsk_EN_EN")
The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: WEATHER7, WEATHER4, SUMMARY_GE, cls_hidden_state, WEATHER1, WEATHER6, INJSEVA, SCASEID, mean_hidden_state, NUMTOTV, WEATHER8, SUMMARY_EN, WEATHER3, index, words per case summary, WEATHER2, INJSEVB, level_0, WEATHER5. If WEATHER7, WEATHER4, SUMMARY_GE, cls_hidden_state, WEATHER1, WEATHER6, INJSEVA, SCASEID, mean_hidden_state, NUMTOTV, WEATHER8, SUMMARY_EN, WEATHER3, index, words per case summary, WEATHER2, INJSEVB, level_0, WEATHER5 are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message. ***** Running Prediction ***** Num examples = 1390 Batch size = 8
train EN, test EN
accuracy score = 99.4%, log loss = 0.032, Brier loss = 0.012
classification report
precision recall f1-score support
1 0.99 1.00 0.99 389
2 1.00 0.99 0.99 795
3+ 0.99 0.99 0.99 206
accuracy 0.99 1390
macro avg 0.99 0.99 0.99 1390
weighted avg 0.99 0.99 0.99 1390
# evaluate model performance using predictions on the German test set (cross-lingual test)
predictions_ge = trainer.predict(dataset_ge["test"])
_ = evaluate_classifier(predictions_ge.label_ids, None, softmax(predictions_ge.predictions, axis=1), labels, "train EN, test GE", "cm_nv_task_EN_GE")
The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: WEATHER7, WEATHER4, SUMMARY_GE, cls_hidden_state, WEATHER1, WEATHER6, INJSEVA, SCASEID, mean_hidden_state, NUMTOTV, WEATHER8, SUMMARY_EN, WEATHER3, index, words per case summary, WEATHER2, INJSEVB, level_0, WEATHER5. If WEATHER7, WEATHER4, SUMMARY_GE, cls_hidden_state, WEATHER1, WEATHER6, INJSEVA, SCASEID, mean_hidden_state, NUMTOTV, WEATHER8, SUMMARY_EN, WEATHER3, index, words per case summary, WEATHER2, INJSEVB, level_0, WEATHER5 are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message. ***** Running Prediction ***** Num examples = 1390 Batch size = 8
train EN, test GE
accuracy score = 98.9%, log loss = 0.046, Brier loss = 0.019
classification report
precision recall f1-score support
1 0.99 1.00 0.99 389
2 0.99 0.99 0.99 795
3+ 0.97 0.97 0.97 206
accuracy 0.99 1390
macro avg 0.98 0.99 0.99 1390
weighted avg 0.99 0.99 0.99 1390
The scores on the English test set have improved to fantastic levels.
What is even more impressive is the performance on cross-lingual transfer: Despite the fact that the model has been trained on English texts only, its performance scores on the German test set are very good.
This is an excellent result!
As seen in the previous section, predicting the number of vehicles from the available accident descriptions is a relatively easy task for the transformer model, even in a multi-lingual situation.
Therefore, we will turn to a somewhat more difficult task: identifying cases which lead to bodily injuries. We cuse the column INJSEVB as label.
The process is identical to the previous case study:
In case you have skipped Section 4.1 Domain-specific finetuning, the dataset ../datasets/dataset_en_pretrained will not be available.
In this case simply comment out the last lines of each block below.
dataset_en = load_from_disk("./datasets/dataset_en")
dataset_ge = load_from_disk("./datasets/dataset_ge")
dataset_mx = load_from_disk("./datasets/dataset_mx")
dataset_pr = load_from_disk("./datasets/dataset_en_pretrained")
# map injuries
labels = ["0", "1"]
dataset_en = dataset_en.rename_column("INJSEVB", "labels")
dataset_ge = dataset_ge.rename_column("INJSEVB", "labels")
dataset_mx = dataset_mx.rename_column("INJSEVB", "labels")
dataset_pr = dataset_pr.rename_column("INJSEVB", "labels")
x_train_en, y_train_en, x_test_en, y_test_en = get_xy(dataset_en, "mean_hidden_state", "labels")
x_train_ge, y_train_ge, x_test_ge, y_test_ge = get_xy(dataset_ge, "mean_hidden_state", "labels")
x_train_mx, y_train_mx, x_test_mx, y_test_mx = get_xy(dataset_mx, "mean_hidden_state", "labels")
x_train_pr, y_train_pr, x_test_pr, y_test_pr = get_xy(dataset_pr, "mean_hidden_state", "labels")
# fit dummy classifier
clf_dummy = dummy_classifier(x_train_en, y_train_en)
_ = evaluate_classifier(y_test_en, None, clf_dummy.predict_proba(x_test_en), labels, "Dummy classifier", "cm_inj_dummy")
Dummy classifier
accuracy score = 58.7%, log loss = 0.679, Brier loss = 0.486
classification report
precision recall f1-score support
0 0.59 1.00 0.74 816
1 0.00 0.00 0.00 574
accuracy 0.59 1390
macro avg 0.29 0.50 0.37 1390
weighted avg 0.34 0.59 0.43 1390
# fit logistic regression classifier to the encoded English texts (by the original DistilBERT model)
clf_en = logistic_regression_classifier(x_train_en, y_train_en, c=10)
_ = evaluate_classifier(y_test_en, None, clf_en.predict_proba(x_test_en), labels, "Logistic regression, DistilBERT", "cm_inj_lr")
Logistic regression, DistilBERT
accuracy score = 80.1%, log loss = 0.400, Brier loss = 0.259
classification report
precision recall f1-score support
0 0.83 0.83 0.83 816
1 0.76 0.75 0.76 574
accuracy 0.80 1390
macro avg 0.79 0.79 0.79 1390
weighted avg 0.80 0.80 0.80 1390
In case you have skipped Section 4.1 Domain-specific finetuning, please also skip the following cell.
# fit logistic regression classifier to the encoded English texts (by the fine-tuned DistilBERT model)
clf_pr = logistic_regression_classifier(x_train_pr, y_train_pr, c=10)
_ = evaluate_classifier(y_test_pr, None, clf_pr.predict_proba(x_test_pr), labels, "Logistic regression - 2 epochs pre-training", "cm_inj_pr")
Logistic regression - 2 epochs pre-training
accuracy score = 82.7%, log loss = 0.375, Brier loss = 0.238
classification report
precision recall f1-score support
0 0.85 0.86 0.85 816
1 0.79 0.79 0.79 574
accuracy 0.83 1390
macro avg 0.82 0.82 0.82 1390
weighted avg 0.83 0.83 0.83 1390
We observe the following:
0 is better than on the class 1 because of a large number of false positives.Next, we perform task-specific fine-tuning. On an AWS EC2 p2.xlarge instance, the run time is about 20 minutes.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42) # for reproducibility, set random seed before instantiating the model
model_cls_inj = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(labels)).to(device)
batch_size = 8
logging_steps = len(dataset_en["train"]) // batch_size
training_args = TrainingArguments(
output_dir="models/" + model_name + "inj_epochs",
num_train_epochs= 2,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
metric_for_best_model="f1",
disable_tqdm=False,
logging_steps=logging_steps,
save_strategy=trainer_utils.IntervalStrategy.NO,
)
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}
trainer = Trainer(model=model_cls_inj, args=training_args,
compute_metrics=compute_metrics, train_dataset=dataset_en["train"], eval_dataset=dataset_en["test"])
trainer.train();
trainer.save_model("models/" + model_name + "_inj")
loading configuration file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/cf37a9dc282a679f121734d06f003625d14cfdaf55c14358c4c0b8e7e2b89ac9.7a727bd85e40715bec919a39cdd6f0aba27a8cd488f2d4e0f512448dcd02bf0f
Model config DistilBertConfig {
"_name_or_path": "distilbert-base-multilingual-cased",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"output_past": true,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"transformers_version": "4.19.2",
"vocab_size": 119547
}
loading weights file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/pytorch_model.bin from cache at /home/ubuntu/.cache/huggingface/transformers/7b48683e2e7ba71cd1d7d6551ac325eceee01db5c2f3e81cfbfd1ee7bb7877f2.c24097b0cf91dbc66977325325fd03112f0f13d0e3579abbffc8d1e45f8d0619
Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: mean_hidden_state, words per case summary, WEATHER7, WEATHER2, WEATHER4, NUMTOTV, SUMMARY_GE, WEATHER8, INJSEVA, SCASEID, SUMMARY_EN, index, WEATHER3, level_0, cls_hidden_state, WEATHER5, WEATHER1, WEATHER6. If mean_hidden_state, words per case summary, WEATHER7, WEATHER2, WEATHER4, NUMTOTV, SUMMARY_GE, WEATHER8, INJSEVA, SCASEID, SUMMARY_EN, index, WEATHER3, level_0, cls_hidden_state, WEATHER5, WEATHER1, WEATHER6 are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning:
This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
***** Running training *****
Num examples = 5559
Num Epochs = 2
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 1390
| Step | Training Loss |
|---|---|
| 694 | 0.537200 |
| 1388 | 0.316700 |
Training completed. Do not forget to share your model on huggingface.co/models =) Saving model checkpoint to models/distilbert-base-multilingual-cased_inj Configuration saved in models/distilbert-base-multilingual-cased_inj/config.json Model weights saved in models/distilbert-base-multilingual-cased_inj/pytorch_model.bin
# Execute the following line to load the trained model from disk.
# trainer = Trainer(AutoModelForSequenceClassification.from_pretrained(model_name+"_inj", num_labels=len(labels)).to(torch.device("cuda" if torch.cuda.is_available() else "cpu")))
# evaluate model performance using predictions on the English test set
predictions_en = trainer.predict(dataset_en["test"])
_ = evaluate_classifier(predictions_en.label_ids, None, softmax(predictions_en.predictions, axis=1), labels,
"DistilBERT classifier - 2 epochs task-specific", "cm_inj_tsk")
The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: mean_hidden_state, words per case summary, WEATHER7, WEATHER2, WEATHER4, NUMTOTV, SUMMARY_GE, WEATHER8, INJSEVA, SCASEID, SUMMARY_EN, index, WEATHER3, level_0, cls_hidden_state, WEATHER5, WEATHER1, WEATHER6. If mean_hidden_state, words per case summary, WEATHER7, WEATHER2, WEATHER4, NUMTOTV, SUMMARY_GE, WEATHER8, INJSEVA, SCASEID, SUMMARY_EN, index, WEATHER3, level_0, cls_hidden_state, WEATHER5, WEATHER1, WEATHER6 are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message. ***** Running Prediction ***** Num examples = 1390 Batch size = 8
DistilBERT classifier - 2 epochs task-specific
accuracy score = 90.0%, log loss = 0.266, Brier loss = 0.154
classification report
precision recall f1-score support
0 0.91 0.92 0.92 816
1 0.88 0.87 0.88 574
accuracy 0.90 1390
macro avg 0.90 0.90 0.90 1390
weighted avg 0.90 0.90 0.90 1390
We observe the following:
To investigate the prediction errors, we export the predictions into an Excel file with the following columns:
| column | meaning |
|---|---|
SCASEID |
unique identification number of the case |
SUMMARY_EN |
description of the accident, in English |
SUMMARY_TRUNCATED |
description of the accident, in English, truncated to a length of 512 tokens |
INJSEVA |
most serious injury sustained in the case, as per Police Accident Report |
labels |
indicator of odily injury INJSEVB (true label) |
pred |
predicted label |
0 |
probability of negative label |
1 |
probability of positive label |
# export prediction results for error analysis
dataset_en.set_format(type="pandas")
df_res = pd.concat([dataset_en["test"].to_pandas(),
pd.DataFrame(data=softmax(predictions_en.predictions, axis=1), columns=["0", "1"]),
pd.DataFrame(data=np.argmax(predictions_en.predictions, -1).reshape((-1,1)), columns=['pred'])
], axis=1)
df_res = df_res[["SCASEID", "SUMMARY_EN", "INJSEVA", "labels", "pred", "0", "1"]]
dataset_en.set_format()
for i in range(df_res.shape[0]):
df_res.loc[i, "SUMMARY_TRUNCATED"] = tokenizer.convert_tokens_to_string(tokenizer.tokenize(df_res.loc[i, "SUMMARY_EN"], truncation=True))
df_res.to_excel("./results/error_analysis_inj.xlsx")
The first step of the error analysis is to inspect the samples producing false negative and false positive predictions. Reading every single text would be very tedious, therefore it is worthwhile focusing on those examples where the probability assigned to the false prediction was high, i.e., cases where the model was confident but wrong.
Looking at the false negatives, we observe that there are many cases where the model assigns a high probability to negative. We suspect that truncation is responsible for many of the false negatives – the relevant part of the text was discarded.
To address this issue, we split the text into slightly overlapping chunks,
run the prediction on each chunk and apply the logical OR-function to the results.
We implement this functionality in a simple function that returns an additional column pred,
containing a list of predicted labels, with one element for each chunk.
def predict_with_overflow(x, model, feature):
t = tokenizer(x[feature], truncation=True, padding=True, return_overflowing_tokens=True)
input_ids = torch.tensor(t["input_ids"]).to(model.device)
attention_mask = torch.tensor(t["attention_mask"]).to(model.device)
with torch.no_grad():
preds = np.argmax(model(input_ids, attention_mask).logits.cpu(), -1)
return {"preds": preds}
# Execute the following lines to load the trained model and the okenizer from disk.
# model_cls_inj = AutoModelForSequenceClassification.from_pretrained("models/" + model_name + "_inj", num_labels=len(labels)).to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
# tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset_en_overflow = dataset_en["test"].map(predict_with_overflow, batched=False, fn_kwargs={"model": model_cls_inj, "feature": "SUMMARY_EN"})
dataset_en_overflow = dataset_en_overflow.map(lambda x: {"pred": max(x["preds"])})
_ = evaluate_classifier(predictions_en.label_ids, dataset_en_overflow["pred"], None, labels,
"DistilBERT classifier - split inputs", "cm_inj_split")
DistilBERT classifier - split inputs
accuracy score = 93.3%, log loss = nan, Brier loss = nan
classification report
precision recall f1-score support
0 0.97 0.91 0.94 816
1 0.89 0.96 0.92 574
accuracy 0.93 1390
macro avg 0.93 0.94 0.93 1390
weighted avg 0.94 0.93 0.93 1390
dataset_en_overflow.set_format(type="pandas")
df_res = dataset_en_overflow.to_pandas()
df_res = df_res[["SCASEID", "SUMMARY_EN", "INJSEVA", "labels", "pred"]]
dataset_en.set_format()
for i in range(df_res.shape[0]):
df_res.loc[i, "SUMMARY_TRUNCATED"] = tokenizer.convert_tokens_to_string(tokenizer.tokenize(df_res.loc[i, "SUMMARY_EN"], truncation=True))
df_res.to_excel("./results/error_analysis_inj_overflow.xlsx")
The number of false negatives has reduced significantly, as expected, and the accuracy score has improved. Since we have not implemented a logic to combine the predicted probabilities of the different chunks, the log loss and Brier loss cannot be evaluated in this case.
transformers-interpret to Interpret Predictions¶Transformer models are quite complex, and therefore, interpreting model output can be difficult.
Our main interest is in knowing which parts of the input text cause the classifier to arrive at a particular prediction. One way to answer this question is the so-called integrated gradients method. It is provided conveniently by the library transformers_interpret which provides a convenient interface to the library Captum, an open source, extensible library for model interpretability built on PyTorch.
With just a few lines of code, we can run this on individual examples, and receive a graphical output as shown below. Of course, the output is also available in numerical form. We run this on CPU because on the AWS p2.xlarge instance, the GPU ran out of memory.
device = torch.device("cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = model_cls_inj.to(device)
cls_explainer = SequenceClassificationExplainer(model, tokenizer)
loading configuration file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/cf37a9dc282a679f121734d06f003625d14cfdaf55c14358c4c0b8e7e2b89ac9.7a727bd85e40715bec919a39cdd6f0aba27a8cd488f2d4e0f512448dcd02bf0f
Model config DistilBertConfig {
"_name_or_path": "distilbert-base-multilingual-cased",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"output_past": true,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"transformers_version": "4.19.2",
"vocab_size": 119547
}
loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/vocab.txt from cache at /home/ubuntu/.cache/huggingface/transformers/28e5b750bf4f39cc620367720e105de1501cf36ec4ca7029eba82c1d2cc47caf.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/tokenizer.json from cache at /home/ubuntu/.cache/huggingface/transformers/5cbdf121f196be5f1016cb102b197b0c34009e1e658f513515f2eebef9f38093.b33e51591f94f17c238ee9b1fac75b96ff2678cbaed6e108feadb3449d18dc24
loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/tokenizer_config.json from cache at /home/ubuntu/.cache/huggingface/transformers/47087d99feeb3bc6184d7576ff089c52f7fbe3219fe48c6c4fa681e617753256.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
loading configuration file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/cf37a9dc282a679f121734d06f003625d14cfdaf55c14358c4c0b8e7e2b89ac9.7a727bd85e40715bec919a39cdd6f0aba27a8cd488f2d4e0f512448dcd02bf0f
Model config DistilBertConfig {
"_name_or_path": "distilbert-base-multilingual-cased",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"output_past": true,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"transformers_version": "4.19.2",
"vocab_size": 119547
}
# true positive
s = tokenizer.decode(dataset_en["test"][144]["input_ids"][1:511])
word_attributions = cls_explainer(s, n_steps=20)
cls_explainer.visualize("./results/viz_144.html");
| True Label | Predicted Label | Attribution Label | Attribution Score | Word Importance |
|---|---|---|---|---|
| [CLS] This three - vehicle crash occurred in the morning of a weekend on a multi - lane highway near an entrance ra ##mp . The highway runs east and west and divided by a high - tension cable guard ##rail . The bit ##umi ##nou ##s road ##way is dry , level and curve ##d to the left at the location of this crash . The posted speed limit 89 km ##ph ( 65 mph ) and there were no ad ##verse weather conditions . V ##1 , a 2006 Je ##ep Liberty with two occupa ##nts , was west ##bound in lane three inte ##nding to go straight . V ##2 , a 1992 Mitsubishi Dia ##mante with one occupa ##nt , was west ##bound in lane four inte ##nding to go straight . V ##3 , a 1996 Nissan pick ##up with one occupa ##nt , was west ##bound in lane one ( ac ##cel ##eration ra ##mp ) inte ##nding to merge left . An unknown vehicle traveling behind V ##3 switched lane ##s and cut in front of V ##1 . V ##1 attempted to avoid this unknown vehicle by changing lane ##s and striking V ##2 ( event # 1 ) . Subsequently , V ##1 and V ##2 sp ##un across all travel lane ##s and departed the right side of the road . V ##1 was struck in the right side by V ##3 as it sp ##un across the ac ##cel ##eration lane and came to final rest on the right roads ##ide . After V ##2 entered the right roads ##ide it sp ##un into an em ##bank ##ment and rolle ##d ( est . 6 - quarter turns ) and came to final rest on its roof . V ##3 drove off the right side of the road after striking V ##1 . The driver of V ##1 is a 45 - year - old female that refused to be interviewed . She was not injured in the crash and her Je ##ep was driven from the scene . The Critical Pre ##cra ##sh Event for V ##1 was code ##d this vehicle traveling over the lane line on the left side of the travel lane . The Critical Reason for the Critical Event was code ##d in ##corre ##ct eva ##sive action . Other factors code ##d to this driver include chose ina ##pp ##rop ##riate eva ##sive action and poor direction ##al control ( failure to control vehicle with skill ord ##inar ##ily expected ) . The driver of V ##2 is a 40 - year - old female that was not interviewed because of a language barrier ( Korean . ) She was transported to the hospital and her vehicle was to ##wed due to damage . The Critical Pre ##cra ##sh Event was code ##d other vehicle en ##cro ##aching from adjacent lane - over right lane line . The Critical Reason for the Critical Event was not code ##d to this vehicle . The driver [SEP] | ||||
# true positive
s = tokenizer.decode(dataset_en["test"][18]["input_ids"][1:511])
word_attributions = cls_explainer(s, n_steps=20)
cls_explainer.visualize("./results/viz_18.html");
| True Label | Predicted Label | Attribution Label | Attribution Score | Word Importance |
|---|---|---|---|---|
| [CLS] This crash occurred in the south ##bound lane of a two - lane und ##ivi ##ded road ##way . This was a level asp ##halt road that curve ##d slightly to the left , with a posted speed limit of 64 km ##ph ( 40 mph ) . It was early in the evening on a week ##day , conditions were clear , and the road ##way was dry . There were no traffic flow restrictions . V ##1 was a 2002 Chrysler Se ##bring 2 - door convert ##ible . The vehicle was traveling south ##bound and its driver was beginning to nego ##tia ##te a left curve . V ##1 departed the road ##way to the right and struck a telephone pole located on the roads ##ide . V ##1 rota ##ted clock ##wise after the impact and then trip ##ped over its wheels . V ##1 rolle ##d two quarter - turns and came to final rest on its roof . V ##1 was driven by a 69 - year old female who suffered moderate injuries . The driver has since been put into a nur ##sing home and does not reca ##ll any information from the accident . The accident report and medical records indicated that the driver of V ##1 had a blood alcohol content of 0 . 177 . The Critical Pre - crash Event for V ##1 was this vehicle traveling off the edge of the road on the right side . The Critical Reason for the Critical Pre - crash Event was poor direction ##al control , a driver - related factor . Associated factors code ##d to the driver of V ##1 include alcohol use , the medical condition of diabetes and the use of pre ##scription med ##ication to control the diabetes . Medical reports also indicated that the driver of V ##1 had a history of alcohol ##ism . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [SEP] | ||||
# false negative: "leaving an injured passenger" overlooked
s = tokenizer.decode(dataset_en["test"][331]["input_ids"][1:511])
word_attributions = cls_explainer(s, n_steps=20)
cls_explainer.visualize("./results/viz_331.html");
| True Label | Predicted Label | Attribution Label | Attribution Score | Word Importance |
|---|---|---|---|---|
| [CLS] This two vehicle crash occurred late in the evening on a two - lane up ##hill bit ##umi ##nou ##s road ##way , with no traffic controls and a speed limit of 56 km ##ph ( 30 mph ) . Vehicle one ( V ##1 ) was a 2007 Ford e ##cono ##line van driven by a thirty four ( 34 ) year - old male who takes no med ##ication or has any vision restrictions . V ##1 was traveling south in lane one going straight . Vehicle two ( V ##2 ) was a 1994 Honda Civic sedan driven by an unknown aged driver with one passenger . V ##2 was traveling south in lane one . According to a witness V ##2 was traveling at a high rate of speed and attempting to pass V ##1 on the right when the front of V ##2 struck the rear of V ##1 . The driver of V ##2 fled the scene on foot , leaving an injured passenger . Both vehicle ' s came to final rest facing south . V ##2 was to ##wed from the scene . The passenger of V ##2 did not know the driver and refused to speak about the crash due to his illegal status in this country . The critical pre - crash event for V ##1 was code ##d : other motor vehicle in lane , traveling in same direction with higher speed . The critical reason for the critical event was not code ##d to this vehicle . The driver of V ##1 was traveling from one job site to another when V ##1 was rear - ended by V ##2 . He was going straight traveling at the posted speed limit in this residential area and observed V ##2 approach ##ing from the rear in his side mirror . The critical pre - crash event for V ##2 was code ##d : other motor vehicle in lane , traveling in same direction with lower st ##eady speed . The critical reason for the critical event was code ##d to the driver of V ##2 as a driver related factor : poor direction ##al control ( e . g . , failing to control vehicle with skill ord ##inar ##ily expected ) . An associated factor for V ##2 was excessive speed and mis ##jud ##gment of gap . V ##2 ' s left front tire was the wrong size and all tire ##s had low tre ##ad depth . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [SEP] | ||||
# false positive:
s = tokenizer.decode(dataset_en["test"][78]["input_ids"][1:511])
word_attributions = cls_explainer(s, n_steps=20)
cls_explainer.visualize("./results/viz_78.html");
| True Label | Predicted Label | Attribution Label | Attribution Score | Word Importance |
|---|---|---|---|---|
| [CLS] The crash occurred on a north / south four - lane highway with shoulder ##s . It curve ##d to the east ( right ) as it traveled north ##ward with a radius of curva ##ture of 274 meters and a positive 4 % grade . Initially there was a grass median div ##iding the north and south lane ##s but as the highway traveled north the median ended with only a double yellow line separat ##ing the directions of travel . A two - lane side street inter ##sect ##ed on the west side of the highway and traveled southeast . Con ##ditions were dark and dry on a week ##day evening . Vehicle # 1 was a 1987 Mercury Marquis traveling north ##bound on the highway . The driver , apparently confused , attempted to turn left on the side street 29 meters prior to the intersection . The vehicle went down a steep 62 % em ##bank ##ment , striking the ground at the bottom of the em ##bank ##ment with its front . It came to rest facing south with its rear wheels just on the edge of the pave ##d south shoulder and was to ##wed due to damage . Vehicle # 1 was driven by a 54 - year old female that was un ##belt ##ed and not transported to a medical facility . Two adult passengers and an 8 - month child in a safety seat were also not injured . The driver stated she went out the wrong exit from a gas station on the east side of the highway a few hundred meters south of the crash . She intended to turn left on the side street to circle back around and enter a shopping center that was located across the highway from the gas station . App ##aren ##tly she thought that the street sign identify ##ing the side streets name was on the north side of the intersection as opposed to south and initiated the left turn 29 meters before the inter ##sect ##ing pave ##ment began . She said that once she started to turn and realized the error she attempted to brak ##e but the front wheels had left the pave ##ment and the em ##bank ##ment was so steep she could not recover . In ##vesti ##gating tro ##oper ##s agree with researcher that poor vision could have contributed to the scenario and required her to follow up with a vision rete ##sting at a state driver ' s license center . The Critical Pre ##cra ##sh Event for Vehicle # 1 was this vehicle traveling off the edge of the road on the left side . The Critical Reason for the Critical Event was code ##d other recognition error , attempted left turn too early . Associated factors included con ##versi ##ng with passenger and poor direction ##al control ( failure to control vehicle with skill ord ##inar ##ily expected ) . A vehicle view ob ##stru ##ction - related to other was included due [SEP] | ||||
# false positive:
s = tokenizer.decode(dataset_en["test"][915]["input_ids"][1:511])
word_attributions = cls_explainer(s, n_steps=20)
cls_explainer.visualize("./results/viz_915.html");
| True Label | Predicted Label | Attribution Label | Attribution Score | Word Importance |
|---|---|---|---|---|
| [CLS] This crash occurred on a straight level bit ##umi ##nou ##s two lane road ##way that was divided by a painted median . The posted speed limit of 72 km ##ph ( 45 mph ) which reduce ##s to 56 km ##ph ( 35 mph ) 100 meters after the crash site . There is a sign indicating the road ##way narrow ##s . The weather was cloud ##y and the road ##way was partially wet . Traffic flow was normal for that time of day . This crash occurred on a week ##day afternoon . Vehicle 1 , a 2002 Nissan Alt ##ima , was traveling behind Vehicle 2 , a 1991 Chevrolet Lu ##mina , when it drove into the safety zone into the on ##coming traffic lane in order to illegal ##ly pass Vehicle 2 . V ##1 returned to its original lane and impact ##ed with V ##2 ' s front left , with its right rear quarter panel . This sp ##un V ##1 in a clock ##wise position 180 degrees , with V ##1 coming to final rest after impact ##ing an em ##bank ##ment on the right side of the road ##way , with its rear left . Vehicle 1 was to ##wed due to damage . V ##1 came to final rest off the road ##way facing in a northeast ##erly direction . V ##2 came to final rest on the road ##way facing in a south ##erly direction . V ##1 was to ##wed due to damage . V ##2 was to ##wed due to its driver going to the hospital with her baby . Vehicle # 1 , the Nissan Alt ##ima , was driven by a belt ##ed 38 - year - old male who refused to be interviewed . He stated he did not want to be both ##ered " with this sh - t " . The Critical Pre ##cra ##sh Event code ##d to Vehicle 1 was : Other - this vehicle traveling entering the road ##way from the left side of the road ##way . The Critical Reason for the Critical Pre ##cra ##sh Event was code ##d as : driver related factor , aggressive driving behavior . Vehicle # 2 , the Chevrolet , was driven by a belt ##ed 21 year - old female who was not injured . There was a belt ##ed 18 year - old male in the front right seat who was not injured . There was a 6 - month - old female child in a car seat in the second row . The child was taken to the hospital for a check out , accompanied by both other people in the vehicle . This driver stated to her relative that she had seen the driver of V ##1 making " wild ge ##stu ##res " and tail ##gating her . She stated she saw V ##1 coming around her on the left but could only brak ##e before impact . The Critical Pre ##cra ##sh Event code [SEP] | ||||
In this section we use extractive question answering to extract parts of the accident description which indicate the presence of bodily injury. The aim is to reduce the length of the input texts by extracting only the relevant parts.
The easiest implementation of extractive question answering is provided by the pipeline abstraction.
We use deutsche-telekom/bert-multi-english-german-squad2,
a multilingual English German question answering model built on bert-base-multilingual-cased. By specifying device=0 we use GPU support.
model_name_qa ="deutsche-telekom/bert-multi-english-german-squad2"
pl = pipeline("question-answering", model=model_name_qa, tokenizer=model_name_qa, device=0)
questions = ["Was someone injured?", "Was someone transported?"]
https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/config.json not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmpr8343lrt
storing https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/config.json in cache at /home/ubuntu/.cache/huggingface/transformers/98815a531e6b412916e105532c140400a6e221e5d249dbc2652fc3bbbc02bb03.063bf511b0ec1ed4ac464b049fce380c9d6f729f38e5413cc3fa45026ec0a0de
creating metadata file for /home/ubuntu/.cache/huggingface/transformers/98815a531e6b412916e105532c140400a6e221e5d249dbc2652fc3bbbc02bb03.063bf511b0ec1ed4ac464b049fce380c9d6f729f38e5413cc3fa45026ec0a0de
loading configuration file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/98815a531e6b412916e105532c140400a6e221e5d249dbc2652fc3bbbc02bb03.063bf511b0ec1ed4ac464b049fce380c9d6f729f38e5413cc3fa45026ec0a0de
Model config BertConfig {
"_name_or_path": "deutsche-telekom/bert-multi-english-german-squad2",
"architectures": [
"BertForQuestionAnswering"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"directionality": "bidi",
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"position_embedding_type": "absolute",
"transformers_version": "4.19.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 119547
}
loading configuration file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/98815a531e6b412916e105532c140400a6e221e5d249dbc2652fc3bbbc02bb03.063bf511b0ec1ed4ac464b049fce380c9d6f729f38e5413cc3fa45026ec0a0de
Model config BertConfig {
"_name_or_path": "deutsche-telekom/bert-multi-english-german-squad2",
"architectures": [
"BertForQuestionAnswering"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"directionality": "bidi",
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"position_embedding_type": "absolute",
"transformers_version": "4.19.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 119547
}
https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmpm7s4od6g
storing https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/pytorch_model.bin in cache at /home/ubuntu/.cache/huggingface/transformers/5093fe1f6b33244b4044dc6d33d682c1967cf3374acb5031c16c3cf3c38e9a89.11da33732f8611f6f122123b9ec186905e5084656efa0fa47ca5f0f680b7eac8 creating metadata file for /home/ubuntu/.cache/huggingface/transformers/5093fe1f6b33244b4044dc6d33d682c1967cf3374acb5031c16c3cf3c38e9a89.11da33732f8611f6f122123b9ec186905e5084656efa0fa47ca5f0f680b7eac8 loading weights file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/pytorch_model.bin from cache at /home/ubuntu/.cache/huggingface/transformers/5093fe1f6b33244b4044dc6d33d682c1967cf3374acb5031c16c3cf3c38e9a89.11da33732f8611f6f122123b9ec186905e5084656efa0fa47ca5f0f680b7eac8 All model checkpoint weights were used when initializing BertForQuestionAnswering. All the weights of BertForQuestionAnswering were initialized from the model checkpoint at deutsche-telekom/bert-multi-english-german-squad2. If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForQuestionAnswering for predictions without further training. https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmpsxzsjt9u
storing https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/tokenizer_config.json in cache at /home/ubuntu/.cache/huggingface/transformers/bb76acf9011b1f4e14813c2680af980c4a359b2512e38cd5315f68629e78589a.c60f034cf5bf819518a0170960ddb62b4576fa3d01e9021876b801600cbb6f42
creating metadata file for /home/ubuntu/.cache/huggingface/transformers/bb76acf9011b1f4e14813c2680af980c4a359b2512e38cd5315f68629e78589a.c60f034cf5bf819518a0170960ddb62b4576fa3d01e9021876b801600cbb6f42
loading configuration file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/98815a531e6b412916e105532c140400a6e221e5d249dbc2652fc3bbbc02bb03.063bf511b0ec1ed4ac464b049fce380c9d6f729f38e5413cc3fa45026ec0a0de
Model config BertConfig {
"_name_or_path": "deutsche-telekom/bert-multi-english-german-squad2",
"architectures": [
"BertForQuestionAnswering"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"directionality": "bidi",
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"position_embedding_type": "absolute",
"transformers_version": "4.19.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 119547
}
https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/vocab.txt not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmp0ihwpv8x
storing https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/vocab.txt in cache at /home/ubuntu/.cache/huggingface/transformers/93301d199c143a7d7e9b71c94261dc6920fcb9f8c6a1067a9d17f1d77935b8e5.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29 creating metadata file for /home/ubuntu/.cache/huggingface/transformers/93301d199c143a7d7e9b71c94261dc6920fcb9f8c6a1067a9d17f1d77935b8e5.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29 https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmpzaaqdek0
storing https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/special_tokens_map.json in cache at /home/ubuntu/.cache/huggingface/transformers/5438742f7fe793114a6cb6d1ac46a28b6d8b8b0aa8fd55a8ea8f8ddb70b463c7.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
creating metadata file for /home/ubuntu/.cache/huggingface/transformers/5438742f7fe793114a6cb6d1ac46a28b6d8b8b0aa8fd55a8ea8f8ddb70b463c7.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
loading file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/vocab.txt from cache at /home/ubuntu/.cache/huggingface/transformers/93301d199c143a7d7e9b71c94261dc6920fcb9f8c6a1067a9d17f1d77935b8e5.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
loading file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/special_tokens_map.json from cache at /home/ubuntu/.cache/huggingface/transformers/5438742f7fe793114a6cb6d1ac46a28b6d8b8b0aa8fd55a8ea8f8ddb70b463c7.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
loading file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/tokenizer_config.json from cache at /home/ubuntu/.cache/huggingface/transformers/bb76acf9011b1f4e14813c2680af980c4a359b2512e38cd5315f68629e78589a.c60f034cf5bf819518a0170960ddb62b4576fa3d01e9021876b801600cbb6f42
loading configuration file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/98815a531e6b412916e105532c140400a6e221e5d249dbc2652fc3bbbc02bb03.063bf511b0ec1ed4ac464b049fce380c9d6f729f38e5413cc3fa45026ec0a0de
Model config BertConfig {
"_name_or_path": "deutsche-telekom/bert-multi-english-german-squad2",
"architectures": [
"BertForQuestionAnswering"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"directionality": "bidi",
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"position_embedding_type": "absolute",
"transformers_version": "4.19.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 119547
}
loading configuration file https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/98815a531e6b412916e105532c140400a6e221e5d249dbc2652fc3bbbc02bb03.063bf511b0ec1ed4ac464b049fce380c9d6f729f38e5413cc3fa45026ec0a0de
Model config BertConfig {
"_name_or_path": "deutsche-telekom/bert-multi-english-german-squad2",
"architectures": [
"BertForQuestionAnswering"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"directionality": "bidi",
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"position_embedding_type": "absolute",
"transformers_version": "4.19.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 119547
}
We visit each accident report in turn (the context), and ask the model the two questions “Was someone injured?” and “Was someone transported?”. Since the accident reports might provide information on multiple persons, we allow a maximum of four candidate answers for each of the questions, which we concatenate into a single (much shorter) new text.
To achieve this, we write a short function which applies a question answering pipeline to an input text x.
The argument questions is a list of questions.
def get_answers(x, qa_pipeline, questions):
x["INJ"] = ""
for question in questions:
res = qa_pipeline(context=x["SUMMARY_EN"], question=question, top_k=4, handle_impossible_answer=True)
if isinstance(res, dict):
res = [res]
if len(res[0]) > 0:
x["INJ"] = '. '.join([x["INJ"]] + [item["answer"] for item in res])
return x
We apply the question answering function to the entire test set.
On an AWS EC2 p2.xlarge instance, the run time is about 6 minutes. If you want to try the concept on only the first 250 samples, you can use ds_test = dataset["test"].select(range(250).map(...
ds_test = dataset["test"].map(get_answers, batched=False, fn_kwargs={"qa_pipeline": pl, "questions": questions})
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/tokenization_utils_base.py:707: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. /home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/pipelines/question_answering.py:300: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. /home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py:998: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Next, we tokenize the extracted texts and define the labels, and store the dataset for later use:
ds_test = ds_test.map(tokenize, batched=True, fn_kwargs={"column": "INJ"})
ds_test = ds_test.rename_column("INJSEVB", "labels")
ds_test.save_to_disk("./datasets/ds_test")
We load the transformer model that was trained on the classification task...
#ds_test = load_from_disk("./datasets/ds_test")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained("models/" + model_name + "_inj").to(device)
trainer = Trainer(model)
loading configuration file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/cf37a9dc282a679f121734d06f003625d14cfdaf55c14358c4c0b8e7e2b89ac9.7a727bd85e40715bec919a39cdd6f0aba27a8cd488f2d4e0f512448dcd02bf0f
Model config DistilBertConfig {
"_name_or_path": "distilbert-base-multilingual-cased",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"output_past": true,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"transformers_version": "4.19.2",
"vocab_size": 119547
}
loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/vocab.txt from cache at /home/ubuntu/.cache/huggingface/transformers/28e5b750bf4f39cc620367720e105de1501cf36ec4ca7029eba82c1d2cc47caf.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/tokenizer.json from cache at /home/ubuntu/.cache/huggingface/transformers/5cbdf121f196be5f1016cb102b197b0c34009e1e658f513515f2eebef9f38093.b33e51591f94f17c238ee9b1fac75b96ff2678cbaed6e108feadb3449d18dc24
loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/tokenizer_config.json from cache at /home/ubuntu/.cache/huggingface/transformers/47087d99feeb3bc6184d7576ff089c52f7fbe3219fe48c6c4fa681e617753256.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
loading configuration file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/cf37a9dc282a679f121734d06f003625d14cfdaf55c14358c4c0b8e7e2b89ac9.7a727bd85e40715bec919a39cdd6f0aba27a8cd488f2d4e0f512448dcd02bf0f
Model config DistilBertConfig {
"_name_or_path": "distilbert-base-multilingual-cased",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"output_past": true,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"transformers_version": "4.19.2",
"vocab_size": 119547
}
loading configuration file models/distilbert-base-multilingual-cased_inj/config.json
Model config DistilBertConfig {
"_name_or_path": "models/distilbert-base-multilingual-cased_inj",
"activation": "gelu",
"architectures": [
"DistilBertForSequenceClassification"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"output_past": true,
"pad_token_id": 0,
"problem_type": "single_label_classification",
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"torch_dtype": "float32",
"transformers_version": "4.19.2",
"vocab_size": 119547
}
loading weights file models/distilbert-base-multilingual-cased_inj/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.
All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at models/distilbert-base-multilingual-cased_inj.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.
No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
...apply it to the tokenized text extracts and evaluate the predictions.
predictions = trainer.predict(ds_test)
_ = evaluate_classifier(predictions.label_ids, None, softmax(predictions.predictions, axis=1), ["0", "1"], "Extractive QA", "cm_inj_qa")
The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: SUMMARY_MX, words per case summary, WEATHER7, WEATHER2, WEATHER4, NUMTOTV, SUMMARY_GE, WEATHER8, INJSEVA, SCASEID, SUMMARY_EN, index, WEATHER3, level_0, INJ, WEATHER5, WEATHER1, WEATHER6. If SUMMARY_MX, words per case summary, WEATHER7, WEATHER2, WEATHER4, NUMTOTV, SUMMARY_GE, WEATHER8, INJSEVA, SCASEID, SUMMARY_EN, index, WEATHER3, level_0, INJ, WEATHER5, WEATHER1, WEATHER6 are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message. ***** Running Prediction ***** Num examples = 1390 Batch size = 8
Extractive QA
accuracy score = 85.5%, log loss = 0.408, Brier loss = 0.243
classification report
precision recall f1-score support
0 0.85 0.91 0.88 816
1 0.85 0.78 0.82 574
accuracy 0.85 1390
macro avg 0.85 0.84 0.85 1390
weighted avg 0.85 0.85 0.85 1390
The performance is comparable with the logistic regression classifier on mean-pooled encodings of the original texts. On the other hand, from there is a larger number of false negatives than obtained by task-specific training and evaluation on the full-length sequence. This indicates that in some cases the extractive question answering has missed out or suppressed certain relevant parts. For instance, if the original text reads “The driver was injured.”, the extract “The driver” is a correct answer to the question “Was someone injured?”; however, it is too short to detect the presence of an injury from the extract.
Congratulations!
In this notebook, you have learned how to apply transformer-based models to classification tasks that often arise in actuarial applications.
You have seen how to address challenges that often arise in practical applications:
a. The text corpus may be highly domain-specific, i.e., it may use specialized terminology. – In Section 4.1 we have applied domain-specific fine-tuning to improve model performance in a specific domain.
b. Multiple languages might be present in parallel. – In Section 3.5 we have used a multi-lingual transformer model to encode multi-lingual texts and to use this output for a classification task. Performance was good even when one language is underrepresented.
c. Text sequences might be short and ambiguous. Or they might be so long that it is hard to identify the parts relevant to the task. – In this tutorial we have demonstrated two approaches to deal with long texts:
In Section 5.2 we have split long input texts into slightly overlapping chunks and applied the classifier to each chunk separately.
In Section 6 we have used extractive question answering to extract parts of the original texts which are relevant to the task.
d. The amount of training data may be relatively small. In particular, gathering large amounts of labelled data (i.e., text sequences augmented with a target label) might be expensive. – Throughout this workbook, we have used transformer models which have been trained on a large corpus of text data. We have applied these models to the specific task with no or little specific training, thus transferring the language understanding skills to the task at hand.
e. It is important to understand why a model arrives at a particular prediction. – In Section 5.3 we have shown how to visualize which parts of the input text cause the classifier to arrive at a particular prediction.
The notebook Part II deals with another dataset that has only short text descriptions. It demonstrates possible approaches in case no or few labels are available.